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What if we didn't know what section the 

articles were in? 
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Word Distribution 




Documents are 
represented as word 
distributions 
(word counts) 



Word 
Distributions 



Topics: 
Independent 
Word Distributions 




LDA finds independent 
word distributions that the 
documents are related to. 

Documents can be associated 
with more than one topic. 
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Documents are allocated to topics and proportion of their words are 
allocated to a topic. Because it is allocation, it means that topics share 
limited words or allocations. You can't have two topics allocated at 
100% to one document. Topics: 

Independent 
Word Distributions 
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ere are two topics\P\ 
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* gambling 

* play 

* night life 

* comedy 

* movie 



/These word lists look 
look like: Sports ami 
Entertainment ! 



* quarter 

* opponent* theatre 
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MySQL 3.23 Case Study 
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This plot was created 
from MySQL changelog 
topics that could be 
easily named j 
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Data 



• Choose: 

- Source Code 

- Natural Language 

• You can mix the two but you're going to bias topics to either 
language. 

• Try to stick to 1 natural language. If you have a primarily English 
project the German contributors will be noticable. 

• Need to tokenize/split words 

• Blei does not recommend n-grams but you don't need 
to listen to him. He just made LDA that's all. 



Data: Issue Trackers 



© 40 Open ✓ 1 0,1 80 Closed 


Author - 


Labels - 


Milestones » Assignee - 


Sort- 


© [Customizer] Feature Request Live Preview of Customisation 

#14520 opened 24 days ago by WebsiteDeveloper 
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© Sample 

#1 451 7 opened 25 days ago by classbook 











® LESS should support mixins defined as .col-@{class}-@{index} like before 

#14514 opened 25 days ago by allenwlee 



Q No way to get Tooltip instance js 

#1 451 3 opened 25 days ago by tyrsius 



© IE9 crashing in contentEditable mode due to :empty selector IWIfflffffl KH P 15 

#14512 opened 25 days ago by gdelhumeau 



® Missing border radius variables for small and large inputs 

#1451 1 opened 26 days ago by hoho "j" v3.2.1 
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® has-feedback for RTL languages 

#14510 opened 26 days ago by idleberg 



ess I feature 



® Disabled fieldsets don't disable input fields on IE (11) 



ess I docs 
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Data: Issue Trackers 



LESS should support mixins defined as .col-@{class} 
@{index} like before #14514 



©Closed 



allenwlee opened this issue 25 days ago 2 comments 



New issue 




allenwlee commented 25 days ago 



the following will now fail due to . col - xs - 12 



§import 'twitter/bootstrap'; 
.test { 

.text-center; 

. text-uppercase; 

. col-xs-12; 

} 



here is a test app: https://github.com/allenwlee/test-less-rails-bootstrap 
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cvrebert commented 25 days ago 



Labels 



Milestone 

No milestone 



Assignee 

No one assigned 

Notifications 



Owner 



<5> Subscribe 

You're not receiving 
notifications from this 
thread. 



Data: Issue Trackers 



^ ^ ) ft hLtps://api.githyb.com/repos/t:wbs/boot:st:r ap/issues/1 45 1 4 



" u rl " : " https : //api . github . com/ repos/twbs/bootst rap/issues/ 145 14" t 

" labels jj rl " : " https : //api . github . com/ repos/twbs/bootst rap/ issues/ 145 14/labels{/name} " t 
"comments_url" : "https ://api. github.com/repos/twbs/bootstrap/issues/14514/comments" , 
" events jj rl " : " https : //api . github . com/ repos/twbs/bootst rap/issues/ 145 14/ events " , 
"htmljjrl" : "https : //github. com/ twbs/bootst rap/issues/ 145 14" t 
"id": 41743792, 
"number": 145 14 ± 

"title": "LESS should support mixins defined as . col-i|{class}-@{index} like before" ± 
"user": { 

"login": "allenwlee", 

"id": 183 9 2 SB f 

"avatar_url" : "https ://avatars . githubusercontent .com/u/l&392&B7v=2" t 
"gravatar_id" : "" ± 

" u rl " : " https : //api . github . com/users/allenwlee" ± 
" html_u rl " : "https : //github . com/allenwlee" , 

"f ollowers_url" : "https : //api . github. com/users/allenwlee/f ollowers" ± 
"f ollowing_url" : "https : //api . github. com/users/allenwlee/f ollowing{/other_user}" ± 
"gists_url" : "https ://api. github. com/users/allenwlee/gists{/gist_id}" t 
"starred_url" : "https : //api . github. com/users/allenwlee/starred{/owner}{/repo}" J . 
"subscriptions_url" : "https : //api . github. com/users/allenwlee/subscriptions" ± 
"organizationsjjrl" : "https : //api . github. com/users/allenwlee/orgs" , 
" repos_u rl " : "https : / /api . github . com/users/allenwlee/ repos " ± 
"events_url" : "https :7/api . github. com/users/allenwlee/events{/privacy}" ± 

■■ : . i _. j . . _n ii . ii i_ m. j . tt : _ ' j. t l_ >. . *_nn .n * : . i _. j. _ u 



Octokit Issue Extractor 



• Let's go look at github_issues_toJson.rb 

• Uses github API 

• Has to query multiple pages 

• Needs config.json filled out with a real gh 
username and password 

• https://bitbucket.org/abram/lda-chapter-tutorial 



Go look at the code! 



Issue Example 



• Go and look at mirror-gh.sh 

• Go and look at github_issues_toJson.rb 

• Go and look at data/*/large.json 



Pre-processing 



• Loading text 

• Mapping text into final textual representation 

• Lexical analysis of the text 

• Optionally removing stop words 

• Optionally stemming 

• Building a vocabulary 

• Optionally removing uncommon or very 
common words 

• Mapping each text document into a word-bag 



Example Preprocessing 



From lda.py: 



def tokenize( text, tokenizer=_tokenizer ) : 
tokens = f ilter_stopwords( 

tokenizer . tokenize( text. lower () ) ) 
return tokens 



Documents 



# Topics 

Alpha 

Beta 
Iterations 



Document- 
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Matrix 



Word- 
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Matrix 



Alpha and Beta hyperparameters 



• Actually vectors of parameters 

• Most people use a constant setting 

• A rule of thumb: 
- < l/topic 

• p is for topics: specific topics or not 

• a is for documents: associated to few or many topics 

• Larger values p lead to broad topics and smaller values of p lead to narrow topics 

• If a is near 1, we expect to see documents with few topics and documents with many 
topics in equal proportion. 

• If a is less than one, we expect most documents to only use a few topics. 

• If a is greater than one, we expect most documents to use almost every topic. 



In the demo: K Topics = 20, a = 0.01, p = 0.01 



Parameter Tuning? 



• Increasing topics increases memory use 

- But increasing the number of topics will often make 
you miss topics 

• Joshua Campbell says use 

- Mallet or 

- Blei's C implementation 



Run it! 



• Run on existing data: 

python lda_fromJson.py --file \ 
data/boostrap/large.json -passes 10 \ 

--alpha 0.01 -beta 0.01 -topics 20 

• Or 

bash project.sh bootstrap 



Outputs! 



• summary.json 

- JSON summary of the top topic words for each topic extracted, 
ranked by weight. 

• document_topic_map.json 

- Document ID mapped to document topic matrix for that document 

• document_topic_map.csv 

- unnormalized topic weights 

• document_topic_map_norm.csv 

- Normalized topic weights 



Spreadsheet example... 



Let's load the norm.csv file into libreoffice 



Data 



• Image: http://dub.softwarepmcess.es/2014/LDA-Tutoriai.ova 

• Repo: https://bitbucket.org/abram/lda-chapter- 
tutorial/ 



